Goto

Collaborating Authors

 ve bayes


Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content

Amin, Adi Danish Bin Muhammad, Bhuiyan, Mohaiminul Islam, Kamarudin, Nur Shazwani, Toh, Zulfahmi, Nafis, Nur Syafiqah

arXiv.org Artificial Intelligence

The rapid evolution of the gaming industry, driven by technological advancements and a burgeoning community, necessitates a deeper understanding of user sentiments, especially as expressed on popular social media platforms like YouTube. This study presents a sentiment analysis on video games based on YouTube comments, aiming to understand user sentiments within the gaming community. Utilizing YouTube API, comments related to various video games were collected and analyzed using the TextBlob sentiment analysis tool. The pre-processed data underwent classification using machine learning algorithms, including Naïve Bayes, Logistic Regression, and Support Vector Machine (SVM). Among these, SVM demonstrated superior performance, achieving the highest classification accuracy across different datasets. The analysis spanned multiple popular gaming videos, revealing trends and insights into user preferences and critiques. The findings underscore the importance of advanced sentiment analysis in capturing the nuanced emotions expressed in user comments, providing valuable feedback for game developers to enhance game design and user experience. Future research will focus on integrating more sophisticated natural language processing techniques and exploring additional data sources to further refine sentiment analysis in the gaming domain.


Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

Mishu, Sadia Zaman, Rafiuddin, S M

arXiv.org Artificial Intelligence

The demand for text classification is growing significantly in web searching, data mining, web ranking, recommendation systems, and so many other fields of information and technology. This paper illustrates the text classification process on different datasets using some standard supervised machine learning techniques. Text documents can be classified through various kinds of classifiers. Labeled text documents are used to classify the text in supervised classifications. This paper applies these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analyze the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.


Detecting Quishing Attacks with Machine Learning Techniques Through QR Code Analysis

Trad, Fouad, Chehab, Ali

arXiv.org Artificial Intelligence

The rise of QR code based phishing ("Quishing") poses a growing cybersecurity threat, as attackers increasingly exploit QR codes to bypass traditional phishing defenses. Existing detection methods predominantly focus on URL analysis, which requires the extraction of the QR code payload, and may inadvertently expose users to malicious content. Moreover, QR codes can encode various types of data beyond URLs, such as Wi-Fi credentials and payment information, making URL-based detection insufficient for broader security concerns. To address these gaps, we propose the first framework for quishing detection that directly analyzes QR code structure and pixel patterns without extracting the embedded content. We generated a dataset of phishing and benign QR codes and we used it to train and evaluate multiple machine learning models, including Logistic Regression, Decision Trees, Random Forest, Naive Bayes, LightGBM, and XGBoost. Our best-performing model (XGBoost) achieves an AUC of 0.9106, demonstrating the feasibility of QR-centric detection. Through feature importance analysis, we identify key visual indicators of malicious intent and refine our feature set by removing non-informative pixels, improving performance to an AUC of 0.9133 with a reduced feature space. Our findings reveal that the structural features of QR code correlate strongly with phishing risk. This work establishes a foundation for quishing mitigation and highlights the potential of direct QR analysis as a critical layer in modern phishing defenses.


Predicting Survivability of Cancer Patients with Metastatic Patterns Using Explainable AI

Nalela, Polycarp, Rao, Deepthi, Rao, Praveen

arXiv.org Artificial Intelligence

Cancer remains a leading global health challenge and a major cause of mortality. This study leverages machine learning (ML) to predict the survivability of cancer patients with metastatic patterns using the comprehensive MSK-MET dataset, which includes genomic and clinical data from 25,775 patients across 27 cancer types. We evaluated five ML models-XGBoost, Naïve Bayes, Decision Tree, Logistic Regression, and Random Fores using hyperparameter tuning and grid search. XGBoost emerged as the best performer with an area under the curve (AUC) of 0.82. To enhance model interpretability, SHapley Additive exPlanations (SHAP) were applied, revealing key predictors such as metastatic site count, tumor mutation burden, fraction of genome altered, and organ-specific metastases. Further survival analysis using Kaplan-Meier curves, Cox Proportional Hazards models, and XGBoost Survival Analysis identified significant predictors of patient outcomes, offering actionable insights for clinicians. These findings could aid in personalized prognosis and treatment planning, ultimately improving patient care.


Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets

Bai, Shanshan, Kruspe, Anna, Zhu, Xiaoxiang

arXiv.org Artificial Intelligence

Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases, an inherently weakly supervised process that introduces both label noise and sentence level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise difficult to isolate in real-world data. To assess its utility, we compare model performance using Naive Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn effectively, outperforming Naive Bayes Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset offers a novel experimental environment for future noise injection studies and is publicly available on GitHub.


An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Rusli, Andre, Shishido, Makoto

arXiv.org Artificial Intelligence

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.


Harnessing PU Learning for Enhanced Cloud-based DDoS Detection: A Comparative Analysis

Dilworth, Robert, Gudla, Charan

arXiv.org Artificial Intelligence

This paper explores the application of Positive-Unlabeled (PU) learning for enhanced Distributed Denial-of-Service (DDoS) detection in cloud environments. Utilizing the $\texttt{BCCC-cPacket-Cloud-DDoS-2024}$ dataset, we implement PU learning with four machine learning algorithms: XGBoost, Random Forest, Support Vector Machine, and Na\"{i}ve Bayes. Our results demonstrate the superior performance of ensemble methods, with XGBoost and Random Forest achieving $F_{1}$ scores exceeding 98%. We quantify the efficacy of each approach using metrics including $F_{1}$ score, ROC AUC, Recall, and Precision. This study bridges the gap between PU learning and cloud-based anomaly detection, providing a foundation for addressing Context-Aware DDoS Detection in multi-cloud environments. Our findings highlight the potential of PU learning in scenarios with limited labeled data, offering valuable insights for developing more robust and adaptive cloud security mechanisms.


A Comparative Analysis of Machine Learning Models for DDoS Detection in IoT Networks

Shakya, Sushil, Abbas, Robert

arXiv.org Artificial Intelligence

This paper presents the detection of DDoS attacks in IoT networks using machine learning models. Their rapid growth has made them highly susceptible to various forms of cyberattacks, many of whose security procedures are implemented in an irregular manner. It evaluates the efficacy of different machine learning models, such as XGBoost, K-Nearest Neighbours, Stochastic Gradient Descent, and Na\"ive Bayes, in detecting DDoS attacks from normal network traffic. Each model has been explained on several performance metrics, such as accuracy, precision, recall, and F1-score to understand the suitability of each model in real-time detection and response against DDoS threats. This comparative analysis will, therefore, enumerate the unique strengths and weaknesses of each model with respect to the IoT environments that are dynamic and hence moving in nature. The effectiveness of these models is analyzed, showing how machine learning can greatly enhance IoT security frameworks, offering adaptive, efficient, and reliable DDoS detection capabilities. These findings have shown the potential of machine learning in addressing the pressing need for robust IoT security solutions that can mitigate modern cyber threats and assure network integrity.


A Systematic Review of Machine Learning in Sports Betting: Techniques, Challenges, and Future Directions

Galekwa, René Manassé, Tshimula, Jean Marie, Tajeuna, Etienne Gael, Kyandoghere, Kyamakya

arXiv.org Artificial Intelligence

The sports betting industry has experienced rapid growth, driven largely by technological advancements and the proliferation of online platforms. Machine learning (ML) has played a pivotal role in the transformation of this sector by enabling more accurate predictions, dynamic odds-setting, and enhanced risk management for both bookmakers and bettors. This systematic review explores various ML techniques, including support vector machines, random forests, and neural networks, as applied in different sports such as soccer, basketball, tennis, and cricket. These models utilize historical data, in-game statistics, and real-time information to optimize betting strategies and identify value bets, ultimately improving profitability. For bookmakers, ML facilitates dynamic odds adjustment and effective risk management, while bettors leverage data-driven insights to exploit market inefficiencies. This review also underscores the role of ML in fraud detection, where anomaly detection models are used to identify suspicious betting patterns. Despite these advancements, challenges such as data quality, real-time decision-making, and the inherent unpredictability of sports outcomes remain. Ethical concerns related to transparency and fairness are also of significant importance. Future research should focus on developing adaptive models that integrate multimodal data and manage risk in a manner akin to financial portfolios. This review provides a comprehensive examination of the current applications of ML in sports betting, and highlights both the potential and the limitations of these technologies.


Towards improving Alzheimer's intervention: a machine learning approach for biomarker detection through combining MEG and MRI pipelines

Ahmad, Alwani Liyana, Sanchez-Bornot, Jose, Sotero, Roberto C., Coyle, Damien, Idris, Zamzuri, Faye, Ibrahima

arXiv.org Artificial Intelligence

MEG are non invasive neuroimaging techniques with excellent temporal and spatial resolution, crucial for studying brain function in dementia and Alzheimer Disease. They identify changes in brain activity at various Alzheimer stages, including preclinical and prodromal phases. MEG may detect pathological changes before clinical symptoms, offering potential biomarkers for intervention. This study evaluates classification techniques using MEG features to distinguish between healthy controls and mild cognitive impairment participants from the BioFIND study. We compare MEG based biomarkers with MRI based anatomical features, both independently and combined. We used 3 Tesla MRI and MEG data from 324 BioFIND participants;158 MCI and 166 HC. Analyses were performed using MATLAB with SPM12 and OSL toolboxes. Machine learning analyses, including 100 Monte Carlo replications of 10 fold cross validation, were conducted on sensor and source spaces. Combining MRI with MEG features achieved the best performance; 0.76 accuracy and AUC of 0.82 for GLMNET using LCMV source based MEG. MEG only analyses using LCMV and eLORETA also performed well, suggesting that combining uncorrected MEG with z-score-corrected MRI features is optimal.